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Abstract 

The Vapnik-Chervonenkis dimension is a combinatorial parameter that 
reflects the ’’complexity” of a set of sets (a.k.a. concept classes). It has 
been introduced by Vapnik and Chervonenkis in their seminal paper pQ 
and has since fonnd many applications, most notably in machine learning 
theory and in computational geometry. Arguably the most influential 
consequence of the VC analysis is the fundamental theorem of statistical 
machine learning, stating that a concept class is learnable (in some precise 
sense) if and only if its VC-dimension is finite. Fnrthermore, for such 
classes a most simple learning rule - empirical risk minimization (ERM) - 
is guaranteed to succeed. 

The simplest non-trivial structures, in terms of the VC-dimension, are 
the classes (i.e., sets of subsets) for which that dimension is 1. 

In this note we show a couple of curious results concerning such classes. 
The first result shows that such classes share a very simple structure, and, 
as a corollary, the labeling information contained in any sample labeled 
by such a class can be compressed into a single instance. 

The second result shows that due to some subtle measurability issues, 
in spite of the above mentioned fundamental theorem, there are classes of 
dimension 1 for which an ERM learning rule fails miserabljQ. 


1 Preliminaries: The Vapnik-Chervonenkis di¬ 
mension 

Definition 1 ([I]). Let X be any set, let 2^ denote its power set - the set of all 
subsets of X. A concept class is a set of subsets oi X, H Q 2^. We will identify 

have discovered the results presented in this note more than 20 year ago, and have 
mentioned them in public talks as well as private communications over the years. However, 
this is the first time I have written them up for publication. 
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subsets of X with binary valued functions over X (a function —>■ {0,1} is 

identified with with the set = {x & X : h{x) = 1}). 

• H shatters A (Z X \i {hr\ A : h & H} = 2^. Note that for a finite A this 
is equivalent to \{h n A : h € H}\ = 

• The Vapnik-Chervonenkis dimension of H is defined as VCdim{H) = 
sup{|A| : HshattersA}. 

This note focuses on classes whose VC-dimension is 1. The following simple 
claim is well known and can be easily verified. 

Claim 1. Given a class H over some domain set X. If there exists a linear 
order ^ over X such that very member h of H is an initial segment w.r.t. that 
order (namely, for all x,y € X, if x ^ y and h(y) = 1 then h(x) = 1^ then 
VCdim{H) < 1. 

Definition 2. For functions h, f : X ^ the f-representation of h is the 

set hf = {x € X : h{x) ^ f{x). Note that if / is the constant 0 function then 
hf is just the usual set equivalent of the function h. For a class of functions H 
and f : X ^ {0, 1}, we define the f-representation of H as Hj = {hf : h € H}. 

Note that for any concept class H, and any binary valued / as above, VCdim{H) = 
VCdim{Hf). 

The VC dimension plays a major role in machine learning theory. We discuss 
this aspect some more in Section IXTl 

2 A structure theorem for classes of VCdim 1 

In this section we show that classes of VC-dimension 1 are in fact very simple. 
We have already mentioned, in Claim [T] that if the sets in a class H are linearly 
ordered by inclusion, then VCdim{H) = 1. This claim can be somewhat ex¬ 
tended by noting that one does not really need a linear order. In fact, having 
the inclusion partial ordering of the members of H being a tree suffices to imply 
the same conclusion. This is formalized by the following. 

Definition 3. We say that a partial order ^ over some set T” is a tree ordering 
if, for every x G X the initial segment Ix = [y ■ y A x} is linearly ordered 
(under ^). 

Claim 2. Given a class H over some domain set X. If there exists a tree 
ordering ^ over X such that very member h of H is an initial segment w.r.t. 
that order (namely, for all x,y € X, if x ^ y and h{x) = 1 then h{y) = 1) then 
VCdim{H) < 1. 

Proof. The proof of that claim is simple - □ 

We will now show that any class having VC-dimension 1 has such a structure. 
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Theorem 4. Let H be a concept class over some domain X. The following 
statements are equivalent: 

1. VCdim{H) < 1. 

2. There exists some tree ordering over X and a representation / : T —>■ 
{0,1} such that every element of Hf is an initial segment under that or¬ 
dering relation. 

Proof. 1 implies 2: Just note that if every member of s an initial segment 
under ^ then, for any xi,X 2 € X if there exists some h G H such that h{xi) = 
h{x 2 ) = 1 then it must be the case that either xi ri X 2 or X 2 ^ However, 
in the first case there exist no h' G H such that h'{xi) = 0 and h'{x 2 ) = 1 and 
in the second case there exist no h' G H such that h'{x 2 ) = 0 and h'{xi) = 1, 
therefore the set {xi,X 2 } is not shattered by H. 

2 implies 1: Assume, wd.o.g., that for every x ^ y G X, there exists some 
h G H so that h{x) ^ h{y). Pick some f G H and consider the partial ordering 
defined by 

<f= {{x,y) -.yhGH, h{y) f{y) h{x) ^ f{x)}. 

Lemma [5] shows that this is indeed a tree ordering. The proof is concluded 
by noting that the definition of the relation <J implies that for every h G H, 
the set hf (namely, {x : h)x) ^ f)x)}) is an initial segment w.r.t <J. □ 

Lemma 5. <J is a partial ordering. Namely, it is reflexive, transitive and anti 
symmetric. Furthermore, the assumption that VCdim{H) < 1 implies that <J 
is a the ordering. 

Proof. • Being reflexive and transitive follows trivially from the definition. 

• For anti-symmetry, let x,y be such that both x <J y and y <J x hold. 
It is easy to see that this implies that for all h G H, h{x) = h{y). 

• Assume, by way of contradiction, that is not a tree ordering. This 
means that for some x G X there exist y,z so that y x, z x but 
neither x <J z nor z <J y holds. Let us show that in such a case the 
pair {y, z} is shattered by H (and thus VCdim{H) > 2 contradicting our 
assumption). Pick hi G H for which hi{x) f fix) (such hi exists by 
our assumetion that for every x G X, each of the labels {0,1} are given 
by some h G H). The definition of implies now that hi{y) f f{y) 
and hiiz) fiz). The non-compatibility of y,z implies the existence of 

G H such that h 2 {y) = f{y) and h 2 {z) ^ fiz) and those labels are 
flipped for / 13 . It follows that {/ii,/ 12 , ^ 3 )/} shatter {y,z} and since we 
picked f G H, it follows that H also shatters {y,z}. 

□ 
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2.1 Sample compression for classes of Vcdim 1 

The above structure theorem has a nice implication for the issue of sample 
compression schemes. 

Definition 6. A sample compression scheme of size d for a class iJ is a pair 
of functions, G, such that F maps samples S from IJ^ X {0,1})- to 
samples F{S) € Uo<m<d('^ ^ 1})”^ such that for any such S, if there exists 

some h £ F[ that is constant with S (namely, for all {x,y) £ S, h{x) = y) then 
F{S) C S, and G : Uo<m<d('^ ^ 1})™ ^ 2*^ such that for any S and any 

[x,y)£S, G{F{S)){x)=y. 

A sample compression scheme is called unlaheled if for every G{S) consists 
of just a subset of X (of elements appearing in S'), without their labels. 

Sample compression schemes were introduced by Littlestone and Warmuth 
[5] and a long standing open problem is the conjecture that there is some content 
G such that every concept class of finite VC-dimension has a sample compression 
scheme of size CVcdim{F[). 

Theorem |1] readily implies that every class of VC dimension 1 has an unla¬ 
beled sample compression scheme of size 1 as follows: 

Given a class FI such that VGdim{H) = 1, let / be a member 
of H and <J as in the proof of Theorem 01 For a sample S = 

((cci, h{xi)),... {xm, h(xm)) (for some h £ FI), let 

F{S) = the <^-maximal element in {xi : i < m and h{xi) ^ f{xi)} 

(and F{(xi, h{xi),... (xm, h(xm))) = 0 if {xi : i < m and h(xi) ^ 

/(^d} = 0)- 


Let G be the function that on input x outputs the function G(x) so 
that that on input y £ X, \i x <J y then G{x){y) = f{y), and for 
any y such that y <J x, G{x){y) = 1 — f{y) . 

It is easy to verify that the pair {F, G) is a size 1 unlabeled compression 
scheme for FI. 


3 ERM may fail to learn VC dimension 1 Classes 

3.1 More preliminaries 

The probability setup: For any given domain set X, we will consider prob¬ 
ability distributions over X x {0,1}. Given such a probability distribution P, 
we define the induced labeling rule as the function ip : X [0,1] defined 
by ip{x) = Pljj = l\x), and the induced marginal distribution. Dp, as the 
projection of P on X. We will identify P with the pair {Dp, ip). 
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Definition 7 (0-1 loss). 1. For a probability distribution P over X x {0, 1} 
and h : X ^ {0, 1}, 

Lp{h) = P[{{x,y) : h{x) ^ ?/}] 

2. For a finite S C X x {0,1} and h : X ^ {0,1}, 

r e S : h{x) ^ y}\ 

=-iy- 

Definition 8. A class H, over some domain set, X, has the Uniform Conver¬ 
gence Property (UCP) with respect to a family of probability distributions V 
over X X {0,1}, if for every e > 0,(5 > 0 there exist some mH{e,S) G Af such 
that for every P G P, m > mnie, 6) implies that 

Pr [sup \Ls{h) - Lp{h)\ > e] < (5. 

Learning and Empirical Risk Minimization: A learning rule is a function 
that takes labeled samples as input and outputs a classifier. Formally, it is a 
function A : x {0,1})™ ^ 2'^. 

Definition 9. 1. A learning rule A is an Empirical Risk Minimizer (ERM) 

for some class H, if A{S) G argmin{L 5 (h) : h G H}, for every S G 

X { 0 , nr 

2. A learning rule A is a Probably Approximately Correct (PAC) learner 
for some class H w.r.t. some measurable algebra {X,Vl), if for every 
e > 0, (5 > 0 there exist some mnie, S) £ M such that for every probability 
measure P over {X,il) x {0,1}, m > mpicA) implies that 

Pr [sup Lp{A{S)) — Lpih) > e] < (5. 

It is common to omit the cr-algebra of measurable sets, il, from the nota¬ 
tion. It is implicitly assumed to be the full power set of A if A is finite or 
countably infinite, or the Lebesgue cr-algebra when A is a subset of some 
Euclidean space. 

The following claim is well known and can be easily verified. 

Claim 3. A class H has the Uniform Convergence Property with respect to the 
family of probability distributions V over (A, fl) x {0,1} if and only if any ERM 
learning function is a PAC learner for H. 

The following is a seminal result that, in a sense spearheaded modern ma¬ 
chine learning theory. 

Theorem 10 (Vapnik-Chervonenkis 1971 [1]). A class H has the uniform con¬ 
vergence property if and only if its Vapnik-Chervonenkis dimension is finite. 

In their proof, Vapnik and Chervonenkis invoke some subtle measurability 
assumption. The vast literature of PAC style learning that followed that paper, 
often fails to mention that assumption. In the next section we show that such 
a condition is indeed necessary. 
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3.2 A class of VCdim 1 for which ERM fails 

Let our domain set be the real unit interval [0,1] and let U be the uniform 
(Lebsegue) measure over it. 

Theorem 11. Assuming the continuum hypothesis, there exists a class of VC 
dimension 1 such that, for some probability distribution over its domain and for 
some classifier h € H, some empirical risk minimization (ERM) rule fails badly 
when trained over samples generated by P and labeled by h. More concretely, 
for any sample size m with probability 1 over samples of that size, the error of 
that rule, when applied to the sample, will be 1. 

Furthermore, it is a class of measurable subsets of the unit interval, and the 
probability distribution with respect to which it fails is the uniform distribution 
over that interval. 

Proof. Recall that the continuum hypothesis states that = Hi (in other 
words, that every infinite snbset of reals is either countable or it can be mapped 
onto the full set of reals). It is well known that this assumption implies (in fact, 
equivalent to) the existence of a well ordering, -< over [0,1], so that every initial 
segment is countable (for every r G [0,1], the set {s : s ^ r} is countable). 
Given such an ordering define a class of subsets H = {[0,1]} U {hr : r G [0,1]}, 
where, for each real number r, hr = {s : s ^ r}. 

Note that, by the choice of the ordering relation every set in H is either 
countable or equals the unit interval. Therefore each member of the class H is 
Lebesgue measurable. 

Furthermore, since ^ is an ordering over the real interval, for every s ^ t, 
hs C ht (and every hs is a subset of the set [0,1]. It follows that VC-dim(iJ) = 1. 

We will now show that there is an ERM learning algorithm for the class H 
that fails badly. Define the learning rule A as follows: 

Given any finite sample S = ((ri, yi), (r 2 , 2 / 2 ), ■ • ■ irm,ym)) labeled 
according to some h G H, let = maxjri : (r^, I) G S} and define 
A{S)=hr*. 

Pick t = [0,1] as the target classifier. That is, the labeling rule that assigned 
the value I to every instance. Every training sample has, therefore, the form 
S = ((n, 1), (r 2 ,1),... (rm, !))• By the above definition of the ERM earning 
rule we consider, for any such sample S, Ls{A{S)) = 0. However, since ^(5”) = 
hr for some r G [0,1], and = {s : s ^ r}, by our choice of the ordering 
relation A{S) is a countable set (that is, assigns the label I only to countably 
many instances). It follows that, for the uniform distribution, U, Lij,t{A{S)) = 
P([0,1] \ ^(5”)) = I (where, for a marginal probability distribution, D, and 

labeling rules t,/i : X -G {0,1}, LD_t(/i) D[{x : h{x ^t{x)}\). In other words, 
for every sample size, m, with probability I over P-generated i.i.d. samples, S, 
of that size, the 0 — I loss of ^(5”) is I. □ 

How come the above example does not contradict the Vapnik-Chervonenkis 
characterization of ERM learnability in terms of the VG-dimension (a.k.a. the 
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Fundamental Theorem of Statistical Machine Learning)? The devil is, of course, 
in measurability issues. The common proof of that fundamental theorem goes 
through the double sample trick. 

To prove the theorem one needs to upper bound the probability of the set 
of samples for which an ERM learner fails. Namely 

^Pr [3/i G H such that Ls{h) = 0 but Lp(h) > e}] 

This is usually done by upper bounding that event by 

2 Pr \3h G H such that Lsih) = 0 but Lxih) > e}l 

(and, for any H with a finite VC-dimension, this probability can be shown 
to go to zero, as m goes to oo based on Sauer’s lemma). 

However, for such an argument to go through, it should be the case that last 
probability exists. Namely, that the set 

A™(iJ) = {S, T G (Xx{0,1})^"* : 3h€H such that Ls{h) = 0 but Lr(/i) > e} 

is measurable under the product measure 

In the case of the example used to prove Theorem[TTJ that last measurability 
requirement fails already for m = 1. To see that, note that 

= {ix,y) :x-<y} 

(when we hx the labeling function that labels S and T to be the constant 1 
function). 

Recall that in the above example, the domain set is X = [0,1] and the 
underlying probability distribution is the uniform distribution, or equivalently, 
the Lebesgue measure. 

The fact that {{x^y) : x -< y} is not measurable under the Lebesgue mea¬ 
sure over [0,1]^ follows from the failure of Fubini’s integration lemma for the 
(characteristic function of) that set: 


whereas 



I a;=0 




dx dy 


0 dy = 0 


V=o 




'^x^ydy 


dx 



dx = 1 


The hrst equation holds since for every y, {x : x -< y} is countable, so 
fx=o ^x^ydx = 0 for any y. The second equation holds since for every x, {y : 
X ^ y} is co-countable, so ^^^ydy = 1 for any x. 
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